Learning Causal Structure from Overlapping Variable Sets

نویسندگان

  • Sofia Triantafillou
  • Ioannis Tsamardinos
  • Ioannis G. Tollis
چکیده

Modern data-analysis methods are typically applicable to a single dataset. In particularly, they cannot integratively analyze datasets containing different, but overlapping, sets of variables. We show that by employing causal models instead of models based on the concept of association alone, it is possible to make additional interesting inferences by integrative analysis than by independent analysis of each dataset. Specifically, we assume that all datasets are generated by the a single overarching causal model representable by a Maximal Ancestral Graph; Maximal Ancestral Graphs are a class of graphical independence models designed to model marginal distributions and cope with causal insufficiency (latent confounding variables). We rigorously define the problem of identifying one or all causal models that simultaneously fit the available data. We propose a novel algorithm FCM that converts this problem to a SAT formula whose solutions correspond to all plausible causal models. We also introduce a new kind of graphical model, the Pairwise Causal Graph (PCG), that succinctly summarizes all pairwise causal relations among the variables. Based on FCM, we propose cSAT+, an algorithm that outputs the PCG when given a set of datasets and prove that this algorithm is sound and complete in the absence of statistical errors. In our empirical evaluation on simulated datasets, we show that the integrative analysis using cSAT+ makes more sound causal inferences than by analyzing the datasets in isolation. Examples of interesting inferences include the induction of the absence or the presence of some kind of causal relation between two variables never measured together. The latter observation has significant ramifications for data analysis as it implies that additional causal relations may be inferred from already available datasets, without further studies. We also show empirically that cSAT+ outperforms ION by two orders of magnitude, the first algorithm solving a similar but more general problem, and scales to larger-sized problems than ION.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Constraint-based causal discovery from multiple interventions over overlapping variable sets

Scientific practice typically involves repeatedly studying a system, each time trying to unravel a different perspective. In each study, the scientist may take measurements under different experimental conditions (interventions, manipulations, perturbations) and measure different sets of quantities (variables). The result is a collection of heterogeneous data sets coming from different data dis...

متن کامل

Discussion of "Learning Equivalence Classes of Acyclic Models with Latent and Selection Variables from Multiple Datasets with Overlapping Variables"

In automated causal discovery, the constraint-based approach seeks to learn an (equivalence) class of causal structures (with possibly latent variables and/or selection variables) that are compatible (according to some assumptions, usually the causal Markov and faithfulness assumptions) with the conditional dependence and independence relations found in data. In the paper under discussion, Till...

متن کامل

Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. This problem has also bee...

متن کامل

Integrating Locally Learned Causal Structures with Overlapping Variables

In many domains, data are distributed among datasets that share only some variables; other recorded variables may occur in only one dataset. While there are asymptotically correct, informative algorithms for discovering causal relationships from a single dataset, even with missing values and hidden variables, there have been no such reliable procedures for distributed data with overlapping vari...

متن کامل

Learning Bayesian Network Structure using Markov Blanket in K2 Algorithm

‎A Bayesian network is a graphical model that represents a set of random variables and their causal relationship via a Directed Acyclic Graph (DAG)‎. ‎There are basically two methods used for learning Bayesian network‎: ‎parameter-learning and structure-learning‎. ‎One of the most effective structure-learning methods is K2 algorithm‎. ‎Because the performance of the K2 algorithm depends on node...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010